83 research outputs found
Towards Geometric Understanding of Motion
The motion of the world is inherently dependent on the spatial structure of the world and its geometry. Therefore, classical optical flow methods try to model this geometry to solve for the motion. However, recent deep learning methods take a completely different approach. They try to predict optical flow by learning from labelled data. Although deep networks have shown state-of-the-art performance on classification problems in computer vision, they have not been as effective in solving optical flow. The key reason is that deep learning methods do not explicitly model the structure of the world in a neural network, and instead expect the network to learn about the structure from data. We hypothesize that it is difficult for a network to learn about motion without any constraint on the structure of the world. Therefore, we explore several approaches to explicitly model the geometry of the world and its spatial structure in deep neural networks.
The spatial structure in images can be captured by representing it at multiple scales. To represent multiple scales of images in deep neural nets, we introduce a Spatial Pyramid Network (SpyNet). Such a network can leverage global information for estimating large motions and local information for estimating small motions. We show that SpyNet significantly improves over previous optical flow networks while also being the smallest and fastest neural network for motion estimation. SPyNet achieves a 97% reduction in model parameters over previous methods and is more accurate.
The spatial structure of the world extends to people and their motion. Humans have a very well-defined structure, and this information is useful in estimating optical flow for humans. To leverage this information, we create a synthetic dataset for human optical flow using a statistical human body model and motion capture sequences. We use this dataset to train deep networks and see significant improvement in the ability of the networks to estimate human optical flow.
The structure and geometry of the world affects the motion. Therefore, learning about the structure of the scene together with the motion can benefit both problems. To facilitate this, we introduce Competitive Collaboration, where several neural networks are constrained by geometry and can jointly learn about structure and motion in the scene without any labels. To this end, we show that jointly learning single view depth prediction, camera motion, optical flow and motion segmentation using Competitive Collaboration achieves state-of-the-art results among unsupervised approaches.
Our findings provide support for our hypothesis that explicit constraints on structure and geometry of the world lead to better methods for motion estimation
Generating 3D faces using Convolutional Mesh Autoencoders
Learned 3D representations of human faces are useful for computer vision
problems such as 3D face tracking and reconstruction from images, as well as
graphics applications such as character generation and animation. Traditional
models learn a latent representation of a face using linear subspaces or
higher-order tensor generalizations. Due to this linearity, they can not
capture extreme deformations and non-linear expressions. To address this, we
introduce a versatile model that learns a non-linear representation of a face
using spectral convolutions on a mesh surface. We introduce mesh sampling
operations that enable a hierarchical mesh representation that captures
non-linear variations in shape and expression at multiple scales within the
model. In a variational setting, our model samples diverse realistic 3D faces
from a multivariate Gaussian distribution. Our training data consists of 20,466
meshes of extreme expressions captured over 12 different subjects. Despite
limited training data, our trained model outperforms state-of-the-art face
models with 50% lower reconstruction error, while using 75% fewer parameters.
We also show that, replacing the expression space of an existing
state-of-the-art face model with our autoencoder, achieves a lower
reconstruction error. Our data, model and code are available at
http://github.com/anuragranj/com
Capture, Learning, and Synthesis of 3D Speaking Styles
Audio-driven 3D facial animation has been widely explored, but achieving
realistic, human-like performance is still unsolved. This is due to the lack of
available 3D datasets, models, and standard evaluation metrics. To address
this, we introduce a unique 4D face dataset with about 29 minutes of 4D scans
captured at 60 fps and synchronized audio from 12 speakers. We then train a
neural network on our dataset that factors identity from facial motion. The
learned model, VOCA (Voice Operated Character Animation) takes any speech
signal as input - even speech in languages other than English - and
realistically animates a wide range of adult faces. Conditioning on subject
labels during training allows the model to learn a variety of realistic
speaking styles. VOCA also provides animator controls to alter speaking style,
identity-dependent facial shape, and pose (i.e. head, jaw, and eyeball
rotations) during animation. To our knowledge, VOCA is the only realistic 3D
facial animation model that is readily applicable to unseen subjects without
retargeting. This makes VOCA suitable for tasks like in-game video, virtual
reality avatars, or any scenario in which the speaker, speech, or language is
not known in advance. We make the dataset and model available for research
purposes at http://voca.is.tue.mpg.de.Comment: To appear in CVPR 201
Learning Human Optical Flow
The optical flow of humans is well known to be useful for the analysis of
human action. Given this, we devise an optical flow algorithm specifically for
human motion and show that it is superior to generic flow methods. Designing a
method by hand is impractical, so we develop a new training database of image
sequences with ground truth optical flow. For this we use a 3D model of the
human body and motion capture data to synthesize realistic flow fields. We then
train a convolutional neural network to estimate human flow fields from pairs
of images. Since many applications in human motion analysis depend on speed,
and we anticipate mobile applications, we base our method on SpyNet with
several modifications. We demonstrate that our trained network is more accurate
than a wide range of top methods on held-out test data and that it generalizes
well to real image sequences. When combined with a person detector/tracker, the
approach provides a full solution to the problem of 2D human flow estimation.
Both the code and the dataset are available for research.Comment: British Machine Vision Conference 2018 (Oral
Competitive Collaboration: Joint Unsupervised Learning of Depth, Camera Motion, Optical Flow and Motion Segmentation
We address the unsupervised learning of several interconnected problems in
low-level vision: single view depth prediction, camera motion estimation,
optical flow, and segmentation of a video into the static scene and moving
regions. Our key insight is that these four fundamental vision problems are
coupled through geometric constraints. Consequently, learning to solve them
together simplifies the problem because the solutions can reinforce each other.
We go beyond previous work by exploiting geometry more explicitly and
segmenting the scene into static and moving regions. To that end, we introduce
Competitive Collaboration, a framework that facilitates the coordinated
training of multiple specialized neural networks to solve complex problems.
Competitive Collaboration works much like expectation-maximization, but with
neural networks that act as both competitors to explain pixels that correspond
to static or moving regions, and as collaborators through a moderator that
assigns pixels to be either static or independently moving. Our novel method
integrates all these problems in a common framework and simultaneously reasons
about the segmentation of the scene into moving objects and the static
background, the camera motion, depth of the static scene structure, and the
optical flow of moving objects. Our model is trained without any supervision
and achieves state-of-the-art performance among joint unsupervised methods on
all sub-problems.Comment: CVPR 201
Hand Gesture Recognition System Using Histogram and Neural Network
In this paper, consider the problem facing by distance between hand and the web cam and corresponding image noise in a Hand gesture recognition for human computer interaction (HCI) using a web cam.In this paper a survey of various recent hand gesture recognition systems background information is presented, along with key issues and major challenges of hand gesture recognition system are presented. In this paper consider histogram and neural network approaches for hand detection. At the end of this paper focus on different hand gesture approaches, algorithm, prototype model, technologies and its applications. The present approaches can be mainly divided into Data-Glove Based, Computer Vision Based approach and Drawing gesture. Hand gesture is a method of non-verbal communication for human beings. Using gesture applications human can interact with computer efficiently without any input devices.
DOI: 10.17762/ijritcc2321-8169.160413
NeuMan: Neural Human Radiance Field from a Single Video
Photorealistic rendering and reposing of humans is important for enabling
augmented reality experiences. We propose a novel framework to reconstruct the
human and the scene that can be rendered with novel human poses and views from
just a single in-the-wild video. Given a video captured by a moving camera, we
train two NeRF models: a human NeRF model and a scene NeRF model. To train
these models, we rely on existing methods to estimate the rough geometry of the
human and the scene. Those rough geometry estimates allow us to create a
warping field from the observation space to the canonical pose-independent
space, where we train the human model in. Our method is able to learn subject
specific details, including cloth wrinkles and accessories, from just a 10
seconds video clip, and to provide high quality renderings of the human under
novel poses, from novel views, together with the background
FaceLit: Neural 3D Relightable Faces
We propose a generative framework, FaceLit, capable of generating a 3D face
that can be rendered at various user-defined lighting conditions and views,
learned purely from 2D images in-the-wild without any manual annotation. Unlike
existing works that require careful capture setup or human labor, we rely on
off-the-shelf pose and illumination estimators. With these estimates, we
incorporate the Phong reflectance model in the neural volume rendering
framework. Our model learns to generate shape and material properties of a face
such that, when rendered according to the natural statistics of pose and
illumination, produces photorealistic face images with multiview 3D and
illumination consistency. Our method enables photorealistic generation of faces
with explicit illumination and view controls on multiple datasets - FFHQ,
MetFaces and CelebA-HQ. We show state-of-the-art photorealism among 3D aware
GANs on FFHQ dataset achieving an FID score of 3.5.Comment: CVPR 202
MobileOne: An Improved One millisecond Mobile Backbone
Efficient neural network backbones for mobile devices are often optimized for
metrics such as FLOPs or parameter count. However, these metrics may not
correlate well with latency of the network when deployed on a mobile device.
Therefore, we perform extensive analysis of different metrics by deploying
several mobile-friendly networks on a mobile device. We identify and analyze
architectural and optimization bottlenecks in recent efficient neural networks
and provide ways to mitigate these bottlenecks. To this end, we design an
efficient backbone MobileOne, with variants achieving an inference time under 1
ms on an iPhone12 with 75.9% top-1 accuracy on ImageNet. We show that MobileOne
achieves state-of-the-art performance within the efficient architectures while
being many times faster on mobile. Our best model obtains similar performance
on ImageNet as MobileFormer while being 38x faster. Our model obtains 2.3%
better top-1 accuracy on ImageNet than EfficientNet at similar latency.
Furthermore, we show that our model generalizes to multiple tasks - image
classification, object detection, and semantic segmentation with significant
improvements in latency and accuracy as compared to existing efficient
architectures when deployed on a mobile device. Code and models are available
at https://github.com/apple/ml-mobileoneComment: Accepted at CVPR 202
- …